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3© The present invention is directed to a method of deterring the illicit copying of electronically published 
documents. It includes utilizing a computer system to electronically publish a plurality of copies of a document 

lO having electronically created material thereon for distribution to a plurality of subscribers and operating 

^ programming within the computer system so as to perform the identification code functions. The steps are to 

^ encode the plurality of copies each with a separate, unique identification code, the identification code being 

O based on a unique arrangement of the electronically created material on each such copy; and, creating a 

O codebook to correlate each such identification code to a particular subscriber. In some embodiments, decoding 

^ methods are included with the encoding capabilities. The unique arrangement of the electronically created 

O material may be based on line-shift coding, word-shift coding, or feature enhancement coding (or combinations 

q of these) and may be effected through bitmap alteration of document format file alteration. 
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Field of the Invention 

The present invention involves methods of deterring illicit copying of electronically published docu- 
ments by creating unique identification codes specific to each subscriber. Each copy of the published 
5 document has a unique arrangement of electronically created material, e.g. print material or display 
material, which is not quickly discernable to the untrained human eye. These unique identification codes 
discourage illicit copying and enable a publisher/copyright owner to analyze illicit copies to determine the 
source subscriber. 

10 Detailed Background 

When the quality of reproductions from copy machines became comparable with the original, the cost 
of copies was reduced to a few pennies per page, and the time it took to copy a page was reduced to a 
second or less, then copy machines started to present a threat to publishers. The problem is intensified in 

75 the electronic domain. The quality of a reproduction is identical with the original, there is almost no cost 
associated with making the copy, and with a single keystroke, hundreds of pages can be copied in a 
fraction of a second. In addition, electronic documents can be distributed to large groups, by electronic mail 
or network news services, with almost no effort on the part of the sender. 

The ability to easily and inexpensively copy and distribute electronic documents is considered to be the 

20 main technical problem that must be overcome before electronic publishing can become a viable alternative 
to conventional publishing. Preventing an individual from duplicating a file of data that is in his possession is 
an extremely difficult, if not impossible task. Instead of trying to prevent duplication of general data files, the 
present invention is directed to making electronic publishing more acceptable by making it possible to 
identify the original owner of a bitmap version of the text portion of a document. With the current copyright 

25 laws, the present invention should be adequate to discourage much of the copying and distribution that 
might otherwise occur. An interesting result of the present invention method is that a publisher or copyright 
owner can also determine who the original belonged to when reproduced copies are found. 

Summary of The Invention 

30 

The present invention is directed to a method of deterring the illicit copying of electronically published 
documents. It includes utilizing a computer system to electronically publish a plurality of copies of a 
document having electronically created material thereon for distribution to a plurality of subscribers and 
operating programming within the computer system so as to perform the identification code functions. The 

35 steps are to encode the plurality of copies each with a separate, unique identification code, the identification 
code being based on a unique arrangement of the electronically created material on each such copy; and, 
creating a codebook to correlate each such identification code to a particular subscriber. In some 
embodiments, decoding methods are included with the encoding capabilities. The unique arrangement of 
the electronically created material may be based on line-shift coding, word-shift coding, or feature 

40 enhancement coding (or combinations of these) and may be effected through bitmap alteration or document 
format file alteration. 

Brief Description of The Drawings 

45 The present invention is more fully understood when the present invention specification herein is taken 
in conjunction with the drawings appended hereto, wherein: 

Figure 1 illustrates a flow diagram of an overview of preferred embodiments of the present invention 
methods; 

Figure 2 illustrates a flow diagram of an encoder operation in a present invention method; 
so Figure 3 illustrates a flow diagram of a decoder operation in a present invention method; 

Figure 4 are pseudocodes for simple line spacing encoder operations for PostScript files; 

Figure 5 shows a profile of a recovered document using text line-shift encoding; 

Figure 6 illustrates three examples of feature enhancing in a 5 x 5 pixel array; 

Figure 7 illustrates line-shift encoding with line space measurements shown qualitatively; 
55 Figure 8 shows word-shift encoding with vertical lines to emphasize normal and shift word spacing; 

Figure 9 illustrates the same text as in Figure 8 but without vertical lines to demonstrate that both 

unshifted and shifted word spacing appears natural to the untrained eye; 

Figure 10 shows an example of text of a document with no feature enhancement; 
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Figure 11 shows the Figure 10 text with feature enhancement; 

Figure 12 illustrates the Figure 11 text with the same features enhanced with exaggeration; 

Figure 13 shows a comparison of baseline and centroid detection results as line spacing and font size 

are varied; 

5 Figure 14 shows a comparison of baseline and centroid detection as a text page is recursively copied. 
The results are for 10 point font size with a single pixel spacing; and 
Figure 15 shows a schematic diagram for a noise accumulation model. 

Detailed Description of The Present Invention 

10 

One object and general purpose of techniques of the present invention is to provide a means of 
discouraging the illegitimate copying and dissemination of documents. In the present invention methods, 
document marking embeds a unique identification code within each copy of a document to be distributed, 
and a codebook correlating the identification code to a particular subscriber (recipient) is maintained. 
75 Hence, examination of a recovered document (or in certain cases, a copy of a distributed document) reveals 
the identity of the original document recipient. 

Document marking can be achieved by either altering text formatting, i.e. lines, words, or groups of 
characters, or by altering certain characteristics of textual elements (e.g. altering individual characters). The 
alterations used in marking a document in the present invention method enables the publisher to: 
20 (1 .) embed a codeword that can be identified for security (traceability) purposes, and 
(2.) alter features with as little visible change of appearance as possible. 

Certain types of markings can be detected in the presence of noise, which may be introduced in 
documents by printing, scanning, plain paper copying, etc. 

"Encoded" documents using the present invention methods, can provide security in several possible 
25 ways, including the following: 

(1.) A document can be coded specifically for each site, subscriber, recipient, or user (hereinafter 
referred to as "subscriber"). Then, any dissemination of an encoded document outside of the intended 
subscriber may be traced back to the intended subscriber. 

(2.) A document code can mark a document as legitimately matched to a specific installation of a user 
30 interface (e.g. a particular subscriber computer workstation). If an attempt is made to display a document 
unmatched to this interface, then that interface can be configured in such a way as to refuse display of 
the document. 

1.0 Overview of Applications 

35 

An overview of document production, distribution, and user interaction according to the present 
invention is illustrated in Figure 1 . This shows three paths a document can follow from the publisher 3 to a 
user. The first is the regular paper copy distribution channel 11 (i.e. a user receives a paper journal, etc. 
from the publisher). The second and third paths are electronic dissemination 13 via document database 21 

40 and electronic document interface 23, for user display 15 or through a user printer 17 to create a printed 
document. Whether from the paper copy distribution channel 11 or from the user printer 17, plain paper 
copier 27, for example, may be then used to create illicit paper copy 29. Variations could, of course, be 
made to the flow chart of Figure 1 without exceeding the scope of the present invention. For example, an 
illicit user could scan a legal version with a scanner and then electronically reproduce illicit copies. The 

45 present invention methods cover documents that are applicable along any of these or similar types of 
distribution paths, e.g. published electronically and distributed via fax, via radio communication computer, 
etc. Document coding is performed prior to document dissemination as indicated by encoder 9. 

Documents are encoded while still in electronic form (Figure 2). The techniques to encode documents 
may be used in either of the two following forms: images or formatted document files. The image 

so representation describes each page (or sub-page) of a document as an array of pixels. The image may be 
black and white (also called bitmap), gray-scale, or color. In the remainder of this text, the image 
representation is simply referred to as a "bitmap", regardless of the image color content. The formatted 
document file representation is a computer file describing the document content using such standard format 
description languages as PostScript, troff, SGML, etc. 

55 In a typical application, a bitmap is generated from a formatted document file. The coding technique(s) 
used in the present invention to mark a document will depend in part on the original format supplied to the 
encoder, and the format that the subscriber sees. It is assumed that once a subscriber sees a document 
(e.g. displays a page on a workstation monitor), then he or she can capture and illegitimately disseminate 
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that document. Therefore, coding must be embedded before this subscriber stage. Thus, as in Figure 2, the 
electronic document 31 is encoded at encoder 33 according to a preselected set of alterations set up in 
codebook 35. (It is not essential that the codebook predate the encoding, in some embodiments, the 
codebook may be created from logging of identification codes as used, to correlate these to specific 
5 subscribers, or vice versa.) The encoded documents are each uniquely created as version 1 (37), version 2 
(39), version 3 (41), version 4 (43)....through version N (45). 

A variety of encoding techniques may be used in the present invention methods and these relate to 
altering lines, words or character features (or combinations) without the need to add textual, graphical, 
alphabetical, numerical or other unique identifiers, and to thereby not alert an illicit copier to the code. Thus, 
10 common to all methods is that the codeword is embedded in the document by altering particular aspects of 
already existing features. For instance, consider the codeword 1101 (binary). Reading this code right to left 
from the least significant bit, the first document feature is altered for bit 1 , the second feature is not altered 
for bit 0, and the next two features are altered for the two 1 bits. It is the type of feature that distinguishes 
each particular encoding method: 
is (1 .) Line-Shift Coding- a method of altering the document format file by shifting the locations of text-lines 
to uniquely encode the document This code may be decoded from the format file or bitmap. Lines may 
be dithered horizontally or vertically, for example. The method provides the highest reliability among 
these methods for detection of the code even in images degraded by noise. 

(2.) Feature-Enhancement Coding- a method of altering a document bitmap image by modifying certain 
20 textual element features to uniquely encode the document. One example of such a modification is to 

extend the length of character ascenders. Another is to narrow character width; another is to remove or 

shorten a character section. This type of code is encoded and decoded from the bitmap image. 

(3.) Word-Shift Coding- a method of altering the document format file or image bitmap by shifting the 

locations of words within the text to uniquely encode the document. This coding may be decoded from 
25 the format file or bitmap. This method in preferred embodiments using document format file alteration is 

similar in use to method (1). It typically provides less visible alteration of the document than method (1), 

but decoding from noisy image may be less easily performed. 

A detailed discussion will now follow regarding each of the above three encoding techniques. 
30 2.1 Text-Line Coding 

This is a coding method that is applied to a formatted document file. In the following discussion, it is 
assumed that the formatted document file is in Adobe Systems, Incorporated PostScript- the most common 
Page Description Language Format used today. However, the present invention is also applicable to other 

35 document file formatting programs. PostScript describes the document content a page at a time. Simply 
put, it specifies the content of a text-line (or text-line fragment such as a phrase, word, or character) and 
identifies the location for the text to be displayed. Text location is marked with an x-y coordinate 
representing a position on a virtual page. Depending on the resolution used by the software generating 
PostScript, the location of the text can be modified by as little as 1/720 inch (1/10 of a printer's "point"). 

40 Most laser printers in common use today have somewhat less resolution (e.g. 1/300 inch). 

In one embodiment of the present invention method, prior to distribution, the original PostScript 
document and the codeword are supplied to an encoder. The encoder reads the codeword, and searches 
for the lines which are to be moved. Upon finding a line to be moved, the encoder modifies the original 
(unspaced) Postscript file to incorporate the line spacing adjustments. This is done by increasing or 

45 decreasing the y coordinate of the line to be spaced. The encoder output is an "encoded" PostScript 
document ready for distribution in either electronic or paper form to a subscriber. 

Figure 3 illustrates how a publisher may identify the original recipient (subscriber) of a marked 
document by analysis of a recovered paper copy of the document. That is, given a questionable hard copy 
51, copy 51 is scanned by scanner 53, analyzed by computer 55, decoded with decoder program 57, 

so matched to codebook 59, to determine the source or subscriber version 61 . For example, the "decoder" 
analyzes the line spacing, and extracts the corresponding codeword, uniquely identifying the original 
subscriber. 

A page (or pages) of the illicit copy of the document may be electronically scanned to produce a 
bitmap image of the page. The bitmap image may preferably be subjected to noise reduction to remove 
55 certain types of extraneous markings (i.e. noise introduced by printing a hard copy, plain paper copying, 
electronic scanning, smudges, etc.). The bitmap image may then be rotated to ensure that the text lines are 
perpendicular to the side page edge. A "profile" of the page is found- this is the number of bits on each 
horizontal scan line in the image. The number of such scan lines varies, but in our experiment, the number 
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of lines is around 40 per text-line. The distance between each pair of adjacent text-line profiles may then be 
measured. This is done by one of two approaches- either the distance between the baselines of adjacent 
line profiles is measured, or the difference between centroids (i.e. centers of mass) of adjacent line profiles 
is measured. The interline spacings are then analyzed to determine if spacing has been added or 
5 subtracted. This process, repeated for every line, determines the codeword of the document- this uniquely 
determines the original subscriber. 

Advantages of this method relative to the other present invention methods, are as follows: 
The code can be decoded without the original; Decoding is quite simple; 

It is likely to be the most noise resistant technique. However, this method is likely to be most visible of 
10 coding techniques described herein. 

Figure 4 illustrates a simple line spacing encoder pseudocode for PostScript files. 

Figure 5 shows a graph of a line spacing profile of a recovered document page. The scan line of the 
baseline of each text-line is marked with a " + ". The scan line of the centroid of each text-line is marked 
with a dot. Decoding a page with a line spacing may involve measuring the distance between adjacent text- 
75 line centroids or baselines and determining whether space has been increased, decreased, or left the same 
as the standard. 

2.2 Feature Enhancement Coding 

20 This is a present invention coding method that is applied directly to the bitmap image of the document. 
The bitmap image is examined for chosen features, and those features are altered, or not altered, 
depending on the codeword. These alterations may be widening, narrowing, slanting, subtracting from, or 
adding to the features of the individual characters. For example, upward, vertical endlines of letters- that is, 
the tops of letters, b, d, h, etc. may be extended. These endlines are altered by extending their lengths by 

25 one (or more) pixels, but not otherwise being changed. 

This coding is applied upon the bitmap image, and can be detected upon the printer image. With more 
emphasized coding than suggested below, or with redundancy in the coding, it may also be detected in 
scanned images of printed and photocopied documents. 

30 Advantages of this present invention method are as follows: 

There are a very large number of code possibilities (perhaps 10 times more than for word-shift coding 
and 20 times more than for line-shift coding); 

The coding is performed on the bitmap of the document, thus there is no need for altering the formatted 
35 document file; This is one of the least visible methods of coding the image; 

Disadvantages include: 

This is primarily an image coding technique and is not normally applicable to the format file (the other 
40 techniques are more readily applicable to both); 

This method may be less applicable to photocopied, or otherwise noisy documents, due to the visibility of 
the coding when applied with such magnitude (length) as to also be noise-intolerant; 
This code cannot be detected without the original. 

The pseudocodes for coding and decoding may be as follows, with preference to Figure 6, which shows 
46 three examples of normal coding 63, 65 and 67, and enhanced coding 73, 75 and 77 of a 5 x 5 pixel array: 
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CODING: 

mask off the least significant codeword bit and right - 
shift the codeword 

for each pixel in image in chosen order (e.g. raster- 
scan order) 

{ 

examine kxk (e.g. 5x5) neighborhoods of pixels 
around this center pixel 
if the pattern of pixels within the kxk mask 

matches one of the chosen features as in 
Figure 6 

then if codeword bit is 1, alter feature as in 
Figure 6 

else if codeword bit is 0, leave feature as is 
store (x # y) location of center pixel and 1 or 0 
value of codeword bit 
if codeword = 0, break 
else mask next codeword bit and 
right -shift 



DECODING: 

read in list of codeword bits and corresponding center 
pixel locations 

where coding has been performed on original 

image 
set codedlmage = 1 

for each (x,y) location of coded feature 

examine kxk neighborhood of pixels around center 
pixel location 

if k x k region matches altered pattern and codeword 
bit is 0 

or if pixels have not been altered pattern and 
codeword bit is 1 
then codedlmage = 0, break 

if codedlmage = 1, then image matches code 

if codedlmage = 0, then image does not match code 



2.3 Word-Shift Coding 

This is a coding method that is applicable to documents with variable spacing between adjacent words. 
This encoding is most easily applied to the format file. For each text-line, the largest and smallest spacings 
between words are found. To code a line, the largest spacing is decremented by some amount and the 
smallest is augmented by the same amount. This maintains the overall text-line length, and produces little 
qualitative change on the text image. 

Advantages of this method relative to other present invention methods are as follows: 
It is one of the least visible methods of coding the image; 



6 



EP 0 660 275 A2 



The code cannot be decoded without the original; 
Disadvantages include: 

The code cannot be decoded without the original; 
The pseudocode is as below: 



CODING: 

mask off the least significant codeword* bit and right- 
shift the codeword 

for each text line in the format file 
{ 

if the code bit is 1 

{ 

find the longest space between words 

find the shortest space between words 

shorten longest space and lengthen shorter space 
by a chosen amount (must be <= longest space- 
shortest space) 

store text -line number, altered space 
positions, and codebit 

if codeword = 0, break 

else mask next codeword bit and 

right-shift 



DECODING: 

read in list of codeword bits and corresponding line 
numbers 

and space locations in text-lines 
where coding has been performed on original 
image 
set codedlmage - 1 

for each text -line in coded formatted file and original 
formatted file 

{ 

if codeword bit for a text -line is 1 
then 

if coded spaces in coded image are not 
different from corresponding 

spaces in original image 
then codedlmage = 0 f break 

if codedlmage = 1, then image matches code 

else if codedlmage = 0, then image does not match code 



2.4 Illustrative Review of Altering Techniques 

Figure 7 illustrates an example of line-shift encoding. Note that the second line 83 is shifted down from 
first line 81 by approximately 1/150 inch, which equals delta. Due to differential coding, this causes the 
spacing between the first line 81 and the second line 83 to be greater than normal and the spacing between 
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second line 83 and third line 85 to be less than normal. The spaces between third line 85, fourth line 87 and 
fifth line 89 are normal. 

Figures 8 and 9 illustrate word-shift encoding. Figure 8 shows vertical lines 91, 93, 95, 97, 99, 101, 103, 
105 and 107. These lines are vertical guidelines to show the positioning of each of the words in the top and 

5 bottom lines of Figure 8. The word "for" has intentionally been shifted and, therefore, rests at vertical line 
99 in the bottom line of text and against vertical line 101 in the top line of text. Figure 9 shows the same 
text as in Figure 8, but without the vertical lines to demonstrate that both unshifted and shifted word spacing 
appears natural to the untrained eye. 

Figures 10, 11 and 12 illustrate feature enhancement encoding. Figure 10 shows characters of text 

10 which have not been altered and would typically represent the standard or reference document. Figure 1 1 
shows the same document but with feature enhancement added. Note, for example, that "1" has been 
vertically extended as have the letters "t", "I", and "d" in the first line as well as other characters 
elsewhere. Figure 12 shows the same feature enhancements as in Figure 11, but with exaggeration to 
simply emphasize the enhancement. Based on Figures 11 and 12, an appropriate codeword for the 

75 enhanced feature documents would be 5435 decimal. 

3.0 Application of Error Correction 

Due to noise which may be introduced in the recovered document, the identification process is subject 
20 to error. Clever choices of the set of codewords used to space lines (based on Error Correcting Codes) will 
be used to minimize the chance of detection error. This will establish a tradeoff between the number of 
potential recipients of a document (i.e. the number of codewords) and the probability of correct identifica- 
tion. To illustrate, the following discussion gives detail of how line-shift decoding may preferably be 
enhanced by noise removal. 
25 In general, in the present invention methods, a line-shift decoder extracts a codeword from a (possibly 
degraded) bitmap representation of an encoded document (decoding a recovered, unmodified formatted 
document file is trivial). An illicit copy of an encoded document may be recovered in either electronic or 
paper form. If paper is recovered, a page (or pages) of the document is electronically scanned producing a 
bitmap image of the page(s). Extracting the code from an image file is not as straightforward as doing so 
30 from the format file. Since the image contains ON and OFF bits, rather than ASCII text and formatting 
commands, pattern recognition techniques must be used first to determine the content. Furthermore, since 
noise may be present, image processing techniques are performed to reduce noise and make the job of 
pattern recognition more robust. Some of the techniques used for document decoding from the image are 
as follows: 

35 Salt-and-Pepper Noise Removal- Inking irregularities, copier noise, or just dirt on the paper can 
cause an image to contain black specks in background areas, and white specks within foreground areas 
such as text. Since this noise interferes with subsequent processing, it is desirable to reduce it as much as 
possible. 

A kFIII filter is used, which is designed to reduce salt-and-pepper noise while maintaining document 
40 quality. It does so by discriminating noise from true text features (such as periods and dots) and removing 

the noise. It is a conservative filter, erring on the side of maintaining text features versus reducing noise 

when those two conflict and has been described for document clarification and is known by the artisan. 
Deskewing- Each time that paper documents are photocopied and scanned, the orientation of the text 

lines on the page may be changed from horizontal because of misorientation- skewing- of the page. In 
45 addition, a photocopier may also introduce skewing due to the slight non-linearity of its optics. The success 

of subsequent processing requires that this skew angle be corrected- that the text lines be returned to the 

horizontal in the image file. 

One approach for deskewing by use of the document spectrum, or docstrum, technique is a bottom-up 

segmentation procedure that begins by grouping characters into words, then words into text lines. The 
so average angle of the text lines is measured for a page, and if this is non-zero (not horizontal), then the 

image is rotated to zero skew angle. Rotation, followed by bilinear interpolation to achieve the final 

deskewed image, is a standard digital image processing procedure that can be found in the published 

literature. 

Text-Line Location- After deskewing, the locations of the text lines can be found. A standard 
55 document processing technique called the projection profile is used. This is simply a summation of the 
ON-valued pixels along each row. For a document whose text lines span horizontally, this profile will have 
peaks whose widths are equal to the character height and valleys whose widths are equal to the white 
space between adjacent text lines. The distances between profile peaks determine interline spacing. 
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In one preferred embodiment, the present invention line-shift decoder measures the distance between 
each pair of adjacent individual text line profiles (within the page profile). This is done by one of two 
approaches- either by measuring the distance between the baselines of adjacent line profiles, or by 
measuring the difference between centroids of adjacent line profiles, as mentioned above. A baseline is the 

5 logical horizontal line on which characters sit; a centroid is the center of mass of a text line. As seen in 
Figure 5, discussed above, each text line produces a distinctive profile with two peaks, corresponding to the 
midline and the baseline. The peak in the profile nearest the bottom of each text line is taken to be the 
baseline; if equal peak values occur on neighboring scan lines, the largest value scan line is chosen as the 
baseline scan line. To define the centroid of a text line precisely, suppose the text line profile runs from 

w scan line y, y + 1, — , to y + w, and the respective number of ON bits/scan line, are h(y), h(y + ^) t — , 
h(y + w). Then the text line centroid is given by 



y h(y) + + (y + w)h(y + w) 

75 

h (y) + + h(y + w) 

The measured interline spacings (i.e. between adjacent centroids or baselines) are used to determine if 
20 white space has been added or subtracted because of a text line shift. This process, repeated for every 

line, determines the codeword of the document- this uniquely determines the original recipient. 

The decision rules for detection of line shifting in a page with differential encoding are described. 

Suppose text lines / - 1 and / + 1 are not shifted and text line / is either shifted up or down. In the 

unspaced document, the distance between adjacent baselines, or baseline spacings are the same. Let 
25 and St be the distances between / -1 and /, and between baselines / and / + 1, respectively. Then the 

decision rule is: 

if s f - i> s f : decide fine i shifted down 
if Si - ?< Si : decide tine i shifted up 
30 otherwise : uncertain 

Baseiine Detection Decision Rule (3.2) 

Unlike baseline spacings, centroid spacings between adjacent text lines in the original unspaced 
document are not necessarily uniformly spaced. In centroid-based detection, the decision is based on the 
35 difference of centroid spacings in the spaced and unspaced documents. More specifically, let S/ _ 1 and Sj 
be the centroid spacings between lines / - 1 and /, and between lines / and / + 1, respectively, in the 
spaced document; let f/ - ? and f/ be the corresponding centroid spacings in the unspaced document. Then 
the decision rule is: 

40 if , - i > Si - t f : decide line i shifted down 
otherwise : decide line i shifted up 
Centroid Detection Decision Rule (3.3) 

An error is said to occur if the decoder decides that a text line was moved up (down) when it was 
45 moved down (up). In baseline detection, a second type of error exists. The decoder is uncertain if it cannot 
determine whether a line was moved up or down. Since in the encoding every other line is moved, and this 
information is known to the decoder, false alarms do not occur. 

4.0 Experimental Results 

50 

Two sets of experiments were performed. The first set was designed to test how well line-shift coding 
works with different font sizes and different line spacing shifts in the presence of limited, but typical image 
noise. The second set test was designed to discover how well a fixed line spacing shift could be detected 
as document degradation became increasingly severe. The equipment used in both experiments was as 
55 follows: 

1. Ricoh FS1S 400 dpi Flat Bed Electronic Scanner 

2. Apple LaserWriter llntx 300 dpi laser printer 

3. Xerox 5052 plain paper copier. 

9 
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The printer and copier were selected in part because they are typical of the equipment found in wide 
use in office environments. The particular machines used could be characterized as being heavily used but 
well maintained. Xerox and 5052 are trademarks of Xerox Corp. Apple and LaserWriter are trademarks of 
Apple Computer, Inc. Ricoh and FSI are trademarks of Ricoh Corp. 

5 

4.1 Variable Font Size Experiment 

The first set of experiments each uses a single-spaced page of text in the Times- Roman font. The page 
is coded using the differential encoding scheme. In differential encoding, every other line of text in each 

10 paragraph was kept unmoved, starting with the first line of each paragraph. Each line between two unmoved 
lines was always moved either up or down. That is, for each paragraph, the 1st, 3rd, 5th, etc. lines were 
unmoved, while the 2nd, 4th, etc. lines were moved. Nine experiments were performed using font sizes of 
8, 10 or 12 pixels and shifting alternate lines (within each paragraph) up or down by 1, 2, or 3 pixels. Since 
the printer has a 300 dpi resolution, each pixel corresponds to 1/300 inch, or approximately one-quarter 

75 point. Each coded page was printed on the laser printer, then copied three times. The laser printed page 
will be referred to as the 0th copy; the nth copy, n£1, is produced by copying the n - 1st copy. The third 
copy was then decoded to extract the codeword. That is, the third copy was electronically scanned, the 
bitmap image processed to generate the profile, the profile processed to generate the text line spacings 
(both baseline and centroid spacings), and the codeword detected using these measurements and rules 

20 (3.2-3). 

Figure 13 presents the results of the variable font size experiment for one page of single-spaced text. 
Note that as the font size decreases, more lines can be placed on the page, permitting more information to 
be encoded. Both baseline and centroid approaches detected without error for spacings of at least 2 pixels; 
the centroid approach also had no errors for a 1 pixel spacing. 
25 Though it is not shown in Figure 13, it is noteworthy that some variability will occur in the detection 
performance results, even in repeated "decoding" of the same recovered page. This variability is due in 
part to randomness introduced in electronic scanning. If a page is scanned several times, different skew 
angles will ordinarily occur. The skew will be corrected slightly different in each case, causing detection 
results to vary. 

30 To illustrate this phenomena, the test case (8 point text, 1 pixel spacing) was rescanned 3 additional 
times. The initial text line skew angle (i.e. before deskewing) differed for each scan. In the three rescans, 
the following decoding results were observed under baseline detection: 5 uncertain, 3 uncertain and 1 error, 
and 6 uncertain. Curiously, the line spacings that could not be detected or were in error varied somewhat 
across the retries. This suggests that there may be some decoding performance gained by scanning a 

35 single page multiple times, and combining the results (e.g. averaging). 

4.2 Plain Paper Copying Experiment 

For the second set of experiments, a single-spaced page of text was coded using differential encoding. 
40 The font was fixed to be Times-Roman, font size to be 10 point, and the coding line-shift to be 1 pixel. 
Repeated copies (the 0th, 1st,... 10th copy) of the page were then made, and each copy used in a separate 
experiment. Hence, each successive experiment used a slightly more degraded version of the same text 
page. The experimental results are tabulated in Figure 14. 

No errors were observed through the 10th recursive copy using centroid detection. What is even more 
45 remarkable is that less than half the available signal to noise "margin" has been exhausted by the 10th 
copy. This suggests that many more copies would likely be required to produce even a single error- such a 
document would be illegible! 

Figure 14 shows that, for baseline decoding, detection errors and uncertainties do not increase 
monotonically with the number of copies. Further, the line spacings that could not be detected correctly 
50 varied somewhat from copy to copy. This suggests that line spacing "information" is still present in the text 
baselines, and can perhaps be made available with some additional processing. 

The results of Figure 1 4 report the uncoded error performance of our marking scheme. But the 21 line 
shifts used in the experiment were not chosen arbitrarily. The codeword comprised 3 concatenated 
codewords selected from a Hamming block code, a 1 -error correcting code. Hence, roughly each 1/3 page 
55 was protected from 1 error. Many, but not all, of the errors and uncertainties resulting from baseline 
decoding would have been corrected by this encoding. However, since uncoded centroid detection 
performed so well, it is unclear whether there is any need to supplement it with error correction. 
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5.0 Discussion and Implications of Image Defects 

Image defects resulting from plain paper copying are all too familiar to the reader. The defects most 
significantly affecting the detection results are now briefly discussed. The discussion is largely qualitative-a 
5 more quantitative discussion of image defects and their physical underpinnings is beyond the scope of this 
overview. 

The primary troublesome defect we encountered was text line skew, or the rotation of text lines about a 
point. In most experiments we observed skew angles between [-3*. +3°]. Text line skew was largely 
removed by image rotation, albeit at the expense of the introduction of some distortion. 

10 Blurring also increased with the number of copies produced, indeed ultimately making the 10th copy 
barely legible. Blurring seemed to have surprisingly minor implications in detection performance. Plain 
paper copies were produced at the copier's nominal "copy darkness" setting; blurring typically increases 
with copy darkness. As the number of copies increased, darkness generally varied over a page; regions of 
severe fading were sometimes observed. It is unclear whether blurring or fading is more detrimental to 

75 decoding performance. 

Expansion or shrinking of copy size is another potential problem. It is not unusual to discover a 4% 
page length or width change after 10 copies. Further, expansion along the length and width of a page can 
be markedly different. Copy size changes forced us to use differential encoding- that is, encoding 
information in the relative rather than absolute shifts between adjacent text lines. 

20 Simple inspection of the copies shows both a wide range of horizontal and vertical displacements and 
other image defects (e.g. salt- and-pepper noise) of little consequence. Perhaps the most startling 
degradation is "baseline waviness" (i.e. nonconstant skew across a text line). It is remarkable that detection 
is not dramatically affected by this particular image degradation. 

25 5.1 An Analytical Noise Model 

In this subsection, a simple model of the noise affecting text line centroids is presented. There are two 
types of noise. The first type of noise models the distortions in printing the document; the second type 
models the distortions in copying. This second type of noise increases with the number of copies while the 

30 first type does not. The accumulation of noise is illustrated in Figure 15. This illustrates the theoretical 
model as a document travels from encoder 201 , to the original printer 203, first copier 205, second copier 
207, the last of a series of copiers, i.e. copier K 209, and then decoder 21 1 . 

A page of text with n + 1 text lines yields n + 1 f vertical coordinates yi, — ,y n + i, that represent the 
centroids of the text lines, measured from, say, the top page margin. The centroid spacings, or distance in 

35 scan lines between adjacent centroids, are given by t f ** y t + ; - y? / = 1 , — ,n. 

Hence, for detecting line-shifts, a page of n + 1 text lines is effectively described by n centroid spacings. 

In Figure 15, the fth line spacing shift c, is positive if extra space has been added, negative if space has 
been subtracted, and zero otherwise. The printer noise, v h models the cumulative effect (on the fth centroid 
spacing) of distortions introduced by printing, scanning and image processing. Making the j\h copy adds a 

40 random noise A// to the fth centroid spacing. At the decoder input after the Kth copy, the original centroid 
spacing f, + c f has been distorted to be S/ K . Since the physical processes of printing, scanning, and image 
processing are independent of copying, it is assumed that the random variables v u /=1, — ,n, are 
independent of N*i /= 1 , — ,n, /= 1 , — ,K. 

Let a page of n + 1 text lines be described by the centroid spacings fi, , t n . It is assumed that the 

45 printer noise distorts these spacings to S, = f/ + C\ + v h i ~ 1, ,n (4.1) 

where W, / = 1, — ,n, are independent and identically distributed Gaussian random variables. This 
assumption is supported by the measurements, which yield a mean of Ui = 0.0528 pixel and variance of 
5i 2 = 0.140 pixeP. 

Next, consider the effect of noise introduced by copying. Consider the 0th copy of a page of n + 1 text 
so lines with centroid spacings s^ , — ,s n . Let the first copy of the page be described by centroid spacings Si \ 
— ,s n 1 , where 

s, 1 = $i + N,\ i = 1 , — ,n. (4.2) 

55 Here, A// 1 is the random noise that summarizes the cumulative effect of skewing, scaling, and other 
photographic distortions in the copying process, on the Ah centroid spacing s f . After the /th copy, £1, the 

centroid spacings are denoted by Si j , ,s n j . As in (4.2), these centroid spacings are given by 

$f ~ s^ 1 + A//, / = 1, — ,n. (4.3) where H) is the noise introduced by copying the / - 1st copy. 
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Hence, the centroid spacing sf is corrupted by the total noise: 

sf = s, + (A// 1 + — + Nf). (4.4) 

5 The measurements taken suggest a surprisingly simple statistical behavior for the random copier noise. The 
noise components Nf, ~-,K, are well modeled by Gaussian random variables with mean u = 0.066 

pixel and variance 52 = 0.017 pixeP. The measurements suggest that the random variables N f \ —,Nf are 
also uncorrelated, and by normality, they are thus independent. Hence, the centroid spacing sf on the yth 
copy is 

10 

sf = s, + i,, 1 , / = 1, —,n, (4.5) 

where nf is Gaussian with mean ju and variance js 2 . 

Printer noise and copier noise is now combined to estimate the error probability under centroid 
t5 detection. Consider three adjacent, differentially encoded text lines labeled such that lines / - 1 and / + 1 
are unshifted while line / is shifted (up or down) by c pixels. Let t H i and f,- be the centroid spacings 
between these lines in the original unspaced document, and let S/-1 and s, be the corresponding spacings 
on the 0th copy of the encoded document. Then 

20 S/--I = f^! + c + Vhl (4.6) 

S, = f, - C + V h (4.7) 

where c = +1 if line /" is shifted down and c = -1 if line / is shifted up. Let sVi and sf be the 
25 corresponding centroid spacings on the /fch copy of the document. Then 

= c + v,-, + t,Vi (4.8) 

Sf = t, - C + v, + t,J (4.9) 

30 

where are defined in (4.5). 

Suppose the fih copy of the document is recovered and is to be decoded. Applying the above (4.8 and 
4.9) to the detection rule (3.3): 

35 if - v, > nf - ijVi - 2c : decide line shifted down (4.10) 

otherwise : decide line shifted up 

Since the random variables V/_i, v h and W are mutually independent, the decision variable 

is Gaussian with zero mean and variance 2(6i 2 + JB 2 ). Hence, the probability that a given line is decoded in 
error is 

45 

p(D>-2c up shift) = ip(DS-2c down shift) = ip(DZ-2). (4.11) 

The error probability is easily evaluated using the complementary error function. Using the measurement 
5i 2 = 0.140 and 5 2 = 0.017, the error probability is only approximately 2% on the 20th copy. 

50 

5.2 Comparison of Baseline and Centroid Detection Algorithms 

Detection using either the baseline or centroid of a text line profile offers distinct advantages and 
disadvantages. As expected, the experimental results reveal that centroid-based detection outperforms 
55 baseline-based detection for pages encoded with small line shifts (i.e. 1 pixel) and subject to large 
distortion. This performance difference arises largely because baseline locations are integer valued, while 
centroid locations, being averages, are real valued. Recall that baseline locations are determined by 
detection of a peak in the text line profile. Sometimes this peak is not pronounced- the profile value on scan 
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lines neighboring the baseline are often near the peak value. Hence, relatively little noise can cause the 
peak to shift to a neighboring scan line. A single scan line shift is sufficient to introduce a detection error 
when text lines are encoded with a 1 pixel shift. 

It also appears likely that centroids are less subject to certain imaging defects than are baselines. 
5 Baselines appear relatively vulnerable to line skew (or more precisely, the noise introduced by deskewing). 
Though centroid detection outperforms baseline detection, the latter has other benefits. In particular, 
encoded documents can be decoded without reference to the original, unspaced document. A secure 
document distributor would then be relieved of the need to maintain a library of original document centroid 
spacings for decoding. 

10 Finally, both detection techniques can be used jointly (and indeed, with other techniques) to provide a 
particularly robust, low error probability detection scheme. 

6.0 Conclusion 

15 Making and distributing illegitimate copies of documents can be discouraged if each of the original 
copies is unique, and can be associated with a particular recipient. Several techniques for making text 
documents unique have been described. One of these techniques, based on text line shifting, has been 
implemented as a set of experiments to demonstrate that perturbations in line spacing that are small 
enough to be indiscernible to a casual reader can be recovered from a paper copy of the document, even 

20 after being copied several times. 

In the experiments, the position of the odd numbered lines within each paragraph remains the same 
while the even numbered lines are moved up or down by a small amount. By selecting different line shifts, 
information is encoded into the document. If the document remains in the electronic form throughout the 
experiment, retrieving the encoded information is trivial. To retrieve the information from a paper copy, the 

25 document is scanned back into the computer. Two detection methods have been considered, one based on 
the location of the bottom of the characters on each line, and the other based on the center of mass of each 
line. The advantage of using the baselines is that they are equally spaced before encoding and the 
information can be retrieved without reference to a template. The centers of mass of the lines are not 
equally spaced, however, this technique has been found to be more resilient to the types of distortion 

30 encountered in the printing and copying process. 

The differential encoding mechanism has been selected because the types of distortion that have been 
encountered have canceled out when differences between adjacent lines are considered. In the experi- 
ments, the lines in the document are moved up or down by as little as 1/300 inch, the document is copied 
as many as ten times, then the document is scanned into a computer and decoded. For the set of 

35 experiments that have been conducted, the centroid decoding mechanism has provided an immeasurably 
small error rate. 

Obviously, numerous modifications and variations of the present invention are possible in light of the 
above teachings. It is therefore understood that within the scope of the appended claims, the invention may 
be practiced otherwise than as specifically described herein. 

40 

Claims 

1. A method of deterring the illicit copying of electronically published documents, which comprises: 

(a) utilizing a computer system to electronically publish a plurality of copies of a document having 
45 electronically created material thereon for distribution to a plurality of subscribers; 

(b) operating programming within said computer system so as to perform the following steps: 

(i) encoded said plurality of copies each with a separate, unique identification code, said 
identification code being based on a unique arrangement of the electronically created material on 
each such copy; and, 

50 (ii) creating a codebook to correlate each such identification code to a particular subscriber. 

2. The method of claim 1 wherein said printed material is lines of textual material and said unique 
arrangement of the electronically created material is based on line-shift coding. 

55 3. The method of claim 2 wherein said line-shift coding is accomplished by altering a document format file 
to shift locations of at least one line relative to other lines contained in the document to uniquely 
encode each copy of the document. 
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4. The method of claim 2 wherein said line-shift coding is accomplished by altering a document bitmap 
image to shift locations of at least one line relative to other lines contained in the document to uniquely 
encode each copy of the document. 

5 5. The method of claim 1 wherein said electronically created material includes words arranged in a 
predetermined sequence and said unique arrangement of the material is based on word-shift coding. 

6. The method of claim 5 wherein said word-shift coding is accomplished by altering a document format 
file to shift locations of at least one word relative to other words contained in the document to uniquely 

70 encode each copy of the document. 

7. The method of claim 5 wherein said word-shift coding is accomplished by altering a document bitmap 
image to shift locations of at least one word relative to other words contained in the document to 
uniquely encode each copy of the document. 

75 

8. The method of claim 1 wherein said electronically created material includes standardized print features 
and said unique arrangement of the material is based on feature-altered coding. 

9. The method of claim 8 wherein said feature-altered coding is accomplished by altering a document 
20 bitmap image to alter at least one print feature relative to said standardized print features. 

10. The method of claim 1, wherein operating said programming also includes performing the steps of: 

(iii) creating a first copy of said document as a standard document; 

(iv) creating a plurality of subsequent copies, each with at least one alteration rendering it different 
25 from said standard document, and each being different from one another so that each of said copies 

has a unique identification code based on said at least one alteration; 

(v) comparing each subsequent copy with said standard document to identify a sequence of same 
and different aspects of each such copy relative to said standard document; and, 

(vi) converting said sequence of same and different aspects to a unique binary identification code. 

30 

11. The method of claim 1, wherein said computer system includes a scanner device connected thereto 
and operating said programming includes performing the steps of: 

(1) scanning a copy of a document to feed its image into said computer system; 

(2) analyzing and decoding the image to determine its unique identification code; and, 

35 (3) comparing the resulting identification code to the codebook to determine the particular subscriber 

to which the identification code correlates. 

12. The method of claim 1 wherein operating said programming includes performing the steps of: 

(1 ) receiving a bitmap image for a copy of a previously electronically published document; 
40 (2) analyzing the bitmap image to determine its unique identification code; and, 

(3) comparing the resulting identification code to the codebook to determine the particular subscriber 
to which the identification code correlates. 

13. The method of claim 1, wherein operating said programming includes performing the steps of: 

45 (1) receiving a document format file for a copy of a previously electronically published document; 

(2) analyzing the document format file to determine its unique identifcation code; and 

(3) comparing the resulting identification code to the codebook to determine the particular subscriber 
to which the identification code correlates. 

50 14. The method of claim 11 wherein operating said programming also includes the step of noise reduction. 

15. The method of claim 14 wherein operating said programming also includes the step of noise reduction. 

16. The method of claim 14 wherein said analyzing and decoding is based on baseline differential 
55 determinations, or on centroid differential determinations. 

17. The method of claim 11,12 or 13 wherein said copy of a document is from a document which has been 
encoded by word-shift alteration, or by feature enhancement alteration, or by line-shift alteration. 
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FIG. 1 
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FIG. 4 
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FIG. 6 
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FIG. 7 
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FIG. 10 
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FIG. 13 
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FIG. 15 
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